This notebook explores the use of dependency graphs to visualize and analyze text data. This is similar to the approach shown by Levy, O., & Goldberg, Y. (2014) in their dependency-based word embeddings paper. However, they don’t convert the dependcies into a full corpus graph. Further literature review needs to explore the work done in this area for potential research gaps.
In this notebook, words are represented as nodes in a directed graph by merging all dependency parsed sentence trees in a corpus. The edges are created based on the dependency relationships between words in the text, and these edges have weights relating to the count of occurances the relationship has. The resulting graph can be analyzed to extract keyword insights and ideally to create word embeddings with a different point-of-view than window based methods.
Keyword extraction is compared to a TextRank approach (Mihalcea, R., & Tarau, P. 2004). With it being an unsupervised process - and due to a lack of familiarity with the data. It is tough to compare these initial results and declare a winner. There are similarities between the results (Pearson \(r \approx 0.72\)), but there are some larger disagreements on some words in the corpus. More work needs to be done to investigate how these high residual tokens are being used in the text.
Node2vec (Grover, A., & Leskovec, J. 2016) embeddings are also explored. The embeddings are projected to 2D with UMAP (McInnes, L., Healy, J., & Melville, J. 2018) for visualization. The embeddings are then used to find similar words and to perform analogies.
The hypothesis for this corpus is that querying for similar vectors will be succesful based on subjective interpretation, but that analogies won’t be as useful given the smaller size of the corpus.
In practice, the node2vec process is prohibitively slow (using the R
implementation by node2vec R package). Future experiments
will look into other node embedding strategies (i.e. graph convolutional
networks - Kipf, T. N., & Welling, M. 2016). As of writing this, it
looks like the process with the current parameters won’t finish by the
assignment’s due date… the back-up plan is to use random walks with walk
length and number of walks that might be too small to be useful. Because
of this, little to no commentary provided on node2vec output.
On my machine, this graph using node2vec with the default 80 walk length is processing at a speed of about 5 walks per hour; I’ve arbitrarily set to 32 walks which will take longer than I have until the deadline.
library(tidyverse)
library(udpipe)
library(igraph)
library(visNetwork)
library(plotly)
library(textrank)
library(node2vec)
library(uwot)
udpipe_file <- "english-ewt-ud-2.5-191206.udpipe"
if (!file.exists(udpipe_file)) {
message("udpipe model not found...")
response <- readline("Would you like to download it? (y/n): ")
if (tolower(response) == "y") {
udpipe_download_model(language = "english", dir = dirnmae(udpipe_file))
} else {
message("ok... fix it yourself then...")
}
}
udmodel <- udpipe_load_model(file = udpipe_file)
# Apply udpipe and store vocab idxs
udpipe_tag <- function(doc_vec, udmodel) {
tagged_doc <- udpipe_annotate(udmodel, x = doc_vec)
tagged_doc_df <- as.data.frame(tagged_doc)
# Adding vocab idx
vocab_df <- tagged_doc_df |>
select(lemma) |>
group_by(lemma) |>
summarise(n = n()) |>
mutate(vocab_token_id = 1:n())
tagged_doc_df <- vocab_df |>
select(-n) |>
right_join(tagged_doc_df, by = "lemma")
list(tagged_doc_df = tagged_doc_df, vocab_df = vocab_df)
}
# Accept output from udpipe_tag and convert to nodes_df, edges_df, and igraph object
doc2graph <- function(tagged_doc_df, vocab_df, directed = TRUE, node_size_range = c(5, 30)) {
edges_df <- tagged_doc_df |>
left_join(tagged_doc_df, by = c("head_token_id" = "token_id", "sentence_id", "doc_id")) |>
select(from = vocab_token_id.y, to = vocab_token_id.x)
if (!directed) {
edges_df <- edges_df |>
select(to = from, from = to) |>
bind_rows(edges_df)
}
edges_df <- edges_df |>
filter(!is.na(from), !is.na(to)) |>
group_by(from, to) |>
summarise(weight = n(), .groups = "drop") |>
ungroup()
node_degree <- edges_df |>
pivot_longer(cols = c(from, to), values_to = "id") |>
count(id, name = "degree")
if (!directed) {
node_degree <- node_degree |>
mutate(degree = degree / 2)
}
# Merge POS into nodes_df
nodes_df <- vocab_df |>
mutate(label = lemma) |>
select(id = vocab_token_id, label, lemma) |>
left_join(tagged_doc_df |> select(lemma, upos), by = c("label" = "lemma")) |>
left_join(node_degree, by = "id") |>
mutate(
size = degree,
font = list(size = 20), # Increase label font size
color = UPOS_COLOR_MAP[upos]
) |>
mutate(size = ifelse(is.na(size), 0, size)) |>
mutate(size = node_size_range[1] + ((node_size_range[2] - node_size_range[1]) * (degree - min(degree)) / (max(degree) - min(degree)))) |>
group_by(id) |>
slice(1) |>
ungroup() |>
mutate(label = paste0(label, " (", upos, ", degree=", degree, ")"))
# Convert nodes_df and edges_df to igraph object
igraph_obj <- graph_from_data_frame(edges_df, vertices = nodes_df, directed = directed)
list(nodes_df = nodes_df, edges_df = edges_df, igraph_obj = igraph_obj)
}
# For coloring by POS
UPOS_TAGS <- c(
"ADJ", "ADP", "ADV", "AUX", "CCONJ", "DET", "INTJ",
"NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT",
"SCONJ", "SYM", "VERB", "X"
)
UPOS_COLOR_MAP <- scales::hue_pal()(length(UPOS_TAGS))
names(UPOS_COLOR_MAP) <- UPOS_TAGS
UPOS_COLOR_DF <- data.frame(
label = UPOS_TAGS,
color = unname(UPOS_COLOR_MAP)
)
# Plot sentence tree in hierarchical visNetwork
plot_sentence_tree <- function(tagged_sent_df) {
edges_df <- data.frame(
to = tagged_sent_df$token_id,
from = tagged_sent_df$head_token_id
) |>
filter(from != 0)
nodes_df <- data.frame(
id = tagged_sent_df$token_id,
label = paste0(tagged_sent_df$token_id, ": ", tagged_sent_df$token),
color = UPOS_COLOR_MAP[tagged_sent_df$upos],
group = tagged_sent_df$upos
)
legend_items <- UPOS_COLOR_DF |>
filter(label %in% unique(tagged_sent_df$upos))
net <- visNetwork(nodes_df, edges_df)
for (upos in UPOS_TAGS) {
net <- net |>
visGroups(groupname = upos, color = unname(UPOS_COLOR_MAP[upos]))
}
net |>
visLegend(useGroups = FALSE, addNodes = legend_items) |>
visHierarchicalLayout(sortMethod = "directed")
}
# Plot document as graph in visNetwork (no tree layout)
# Slow with large graphs
plot_doc_graph <- function(tagged_doc_df, vocab_df, directed = TRUE) {
nodes_edges <- doc2graph(tagged_doc_df, vocab_df, directed)
legend_items <- UPOS_COLOR_DF |>
filter(label %in% unique(nodes_edges$nodes_df$upos))
net <- visNetwork(nodes_edges$nodes_df, nodes_edges$edges_df)
for (upos in legend_items$label) {
net <- net |>
visGroups(groupname = upos, color = unname(UPOS_COLOR_MAP[upos]))
}
net |>
visLegend(useGroups = FALSE, addNodes = legend_items)
}
# Plot document as graph in igraph
# Use for large graphs (still not useful, but fast to see how unuseful it is)
plot_doc_igraph <- function(tagged_doc_df, vocab_df, directed = TRUE) {
nodes_edges <- doc2graph(tagged_doc_df, vocab_df, directed)
deg <- degree(nodes_edges$igraph_obj)
scaled_size <- scales::rescale(deg, to = c(2, 10))
# only label top % degree nodes
threshold <- quantile(deg, 0.95)
V(nodes_edges$igraph_obj)$label <- ifelse(deg >= threshold, V(nodes_edges$igraph_obj)$label, NA)
legend_entries <- UPOS_COLOR_DF |>
filter(label %in% unique(nodes_edges$nodes_df$upos))
# layout <- layout_with_graphopt(nodes_edges$igraph_obj, charge = 0.05, niter = 1000)
plot(
nodes_edges$igraph_obj,
# layout = layout,
vertex.size = scaled_size,
vertex.label.cex = 0.6,
vertex.label.color = "black",
edge.arrow.size = 0.3,
edge.curved = 0.1,
margin = 0, # REMOVE BIG MARGINS
rescale = TRUE, # Keep this TRUE to fit window
xlim = c(-1, 1),
ylim = c(-1, 1),
asp = 0
)
legend(
"topright",
legend = legend_entries$label,
col = "black",
pt.bg = legend_entries$color,
pch = 21,
pt.cex = 1.5,
cex = 0.6,
bty = "n",
ncol = 1
)
}
patents <- read.csv("target_patents.csv")
patents <- na.omit(patents)
names(patents) <- tolower(gsub("[^[:alnum:]]", "_", names(patents)))
head(patents)
tagged <- udpipe_tag(sample(patents$abstract, 1), udmodel)
tagged$tagged_doc_df
Use scroll to zoom in and out (needed on some long sentences).
rand_sent_id <- sample(tagged$tagged_doc_df$sentence_id, 1)
tagged$tagged_doc_df |>
filter(sentence_id == rand_sent_id) |>
plot_sentence_tree()
Use scroll to zoom in and out (labels disappear at certain level of zoomed out).
plot_doc_graph(tagged$tagged_doc_df, tagged$vocab_df, directed = TRUE)
Static igraph version
plot_doc_igraph(tagged$tagged_doc_df, tagged$vocab_df, directed = TRUE)
Smaller sample of documents
docs <- patents$abstract[1:5]
tagged <- udpipe_tag(docs, udmodel)
plot_doc_graph(tagged$tagged_doc_df, tagged$vocab_df, directed = TRUE)
plot_doc_igraph(tagged$tagged_doc_df, tagged$vocab_df, directed = TRUE)
Wow, very useful visual!
docs <- patents$abstract
tagged <- udpipe_tag(docs, udmodel)
plot_doc_igraph(tagged$tagged_doc_df, tagged$vocab_df, directed = TRUE)
docs <- patents$abstract
FULL_TAGGED <- udpipe_tag(docs, udmodel)
FULL_GRAPH <- doc2graph(FULL_TAGGED$tagged_doc_df, FULL_TAGGED$vocab_df, directed = TRUE)
cat("Number of nodes: ", vcount(FULL_GRAPH$igraph_obj), "\n")
## Number of nodes: 2041
cat("Number of edges: ", ecount(FULL_GRAPH$igraph_obj), "\n")
## Number of edges: 15152
cat("Average degree: ", mean(degree(FULL_GRAPH$igraph_obj)), "\n")
## Average degree: 14.84762
cat("Graph density: ", edge_density(FULL_GRAPH$igraph_obj), "\n")
## Graph density: 0.003639123
cat("Global transitivity: ", transitivity(FULL_GRAPH$igraph_obj, type = "global"), "\n")
## Global transitivity: 0.1035146
hist(degree(FULL_GRAPH$igraph_obj), main = "Degree distribution")
abline(v = mean(degree(FULL_GRAPH$igraph_obj)), col = "red")
Similar to word level TextRank (https://aclanthology.org/W04-3252.pdf) approach - but they build edges based on sliding window. TextRank sentence level approach builds graph off of sentences TF-IDF similarity.
FULL_GRAPH$nodes_df$pagerank <- page_rank(FULL_GRAPH$igraph_obj)$vector
# Remove some uninteresting POS
include_pos <- c("NOUN", "VERB", "ADJ", "PROPN")
exclude_pos <- setdiff(unique(FULL_GRAPH$nodes_df$upos), include_pos)
FULL_GRAPH$nodes_df |>
arrange(-pagerank) |>
filter(upos %in% include_pos) |>
select(c("label", "upos", "pagerank")) |>
head(10)
It should be a more semantically rich graph than TextRank due to the connections between words having direct semantic relationships rather than just co-occurrence.
It should be a less dense graph than TextRank due to the more direct connections.
Does a dense graph beat a sparser more semantically rich graph? (“rich” is kinda arbitrary though)
FULL_TAGGED$tagged_doc_df <- FULL_TAGGED$tagged_doc_df |>
arrange(doc_id, sentence_id, token_id)
# TextRank keyword extraction
keywords <- textrank_keywords(
FULL_TAGGED$tagged_doc_df$lemma,
relevant = FULL_TAGGED$tagged_doc_df$upos %in% include_pos,
ngram_max = 3,
sep = " "
)
textrank_df <- data.frame(
lemma = names(keywords$pagerank$vector),
textrank = keywords$pagerank$vector
)
rownames(textrank_df) <- NULL
textrank_df |>
arrange(-textrank) |>
head(10)
compare_df <- textrank_df |>
inner_join(FULL_GRAPH$nodes_df, by = "lemma") |>
filter(upos %in% include_pos) |>
mutate(residual = textrank - pagerank) |>
select(lemma, upos, textrank, pagerank, residual) |>
arrange(desc(abs(residual)))
compare_df |>
mutate(label = ifelse(min_rank(desc(residual)) <= 10, lemma, NA)) |>
mutate(label = ifelse(min_rank(residual) <= 3, lemma, label)) |>
ggplot(aes(x = textrank, y = pagerank)) +
# ggplot(aes(x = scale(textrank), y = scale(pagerank))) +
geom_point() +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "#bbb") +
geom_text(aes(label = label), hjust = 0, vjust = -0.2) +
labs(
title = "Comparison of TextRank and dependency graph PageRank",
subtitle = paste0("Pearson correlation = ", round(cor(compare_df$textrank, compare_df$pagerank), 3), "; dashed line at x = y"),
x = "TextRank (from co-occurrence)",
y = "PageRank (from dependency graph)"
) +
xlim(c(0, 0.022))
compare_df |>
arrange(residual) |>
head(5)
compare_df |>
arrange(-residual) |>
head(5)
Explore node2vec embeddings (briefly).
start <- Sys.time()
FORCE_RETRAIN <- TRUE
NUM_WALKS <- 8
WALK_LENGTH <- 32
OUTPUT_DIM <- 128
if (FORCE_RETRAIN | !file.exists("node2vec_embeddings.csv")) {
# convert edges from ids to labels
edges_df <- FULL_GRAPH$edges_df |>
left_join(FULL_GRAPH$nodes_df, by = c("from" = "id")) |>
select(from = label, to, weight) |>
left_join(FULL_GRAPH$nodes_df, by = c("to" = "id")) |>
select(from, to = label, weight) |>
mutate(from = gsub(" \\([A-Z]+, degree=\\d+\\)", "", from)) |>
mutate(to = gsub(" \\([A-Z]+, degree=\\d+\\)", "", to)) |>
as.data.frame()
embeddings <- node2vecR(
data = edges_df,
num_walks = NUM_WALKS, # default is 10
walk_length = WALK_LENGTH, # default is 80
directed = TRUE,
dim = OUTPUT_DIM # default is 128
)
# to csv
write.csv(embeddings, "node2vec_embeddings.csv", row.names = TRUE)
} else {
embeddings <- read.csv("node2vec_embeddings.csv", row.names = 1)
}
end <- Sys.time()
end - start
## Time difference of 24.52126 mins
Dimension reduction with UMAP.
umap_result <- umap(embeddings, n_neighbors = 15, min_dist = 0.1, metric = "cosine")
umap_df <- as.data.frame(umap_result) |>
rownames_to_column("term") |>
rename(x = V1, y = V2) |>
left_join(FULL_GRAPH$nodes_df, by = c("term" = "lemma"))
p <- ggplot(umap_df, aes(x = x, y = y, text = term, color = upos)) +
geom_point() +
theme_minimal() +
labs(
title = "UMAP Projection of Node2Vec Embeddings",
subtitle = "Embeddings from dependency graph projected with UMAP (cosine distance)\nHover to see individual terms"
)
ggplotly(p, tooltip = "text")
cos_sim <- function(x, y) sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
most_similar <- function(word, n = 5) {
vec <- embeddings[word, , drop = FALSE]
sims <- apply(embeddings, 1, cos_sim, y = vec)
sort(sims, decreasing = TRUE)[2:(n + 1)]
}
# some of top pagerank dep graph words
query_me <- c("frame", "enclose", "fail", "couple")
lapply(query_me, most_similar, n = 3)
## [1] "frame"
## [[1]]
## expanded wiring l
## 0.3210215 0.2818157 0.2682426
##
## [1] "enclose"
## [[2]]
## remite request where
## 0.2836164 0.2818911 0.2727288
##
## [1] "fail"
## [[3]]
## axis furniture instruction
## 0.3166027 0.2943973 0.2789573
##
## [1] "couple"
## [[4]]
## information demarcate deck
## 0.3159187 0.3088126 0.2813700
Kind of expecting a lot from this much data
analogy <- function(a, b, c, top_n = 3) {
cat("Analogy: '", a, "' is to '", b, "' as '", c, "' is to ?\n")
vec <- embeddings[b, ] - embeddings[a, ] + embeddings[c, ]
sims <- apply(embeddings, 1, cos_sim, y = vec)
sort(sims, decreasing = TRUE)[1:(top_n + 1)]
}
analogy("barcode", "product", "compartment")
## Analogy: ' barcode ' is to ' product ' as ' compartment ' is to ?
## compartment product promote span
## 0.5847047 0.5719872 0.2646405 0.2524163
Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Levy, O., & Goldberg, Y. (2014, June). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 302-308).
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
Mihalcea, R., & Tarau, P. (2004, July). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 404-411).